I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
You must ensure that the content is not used for further training of the model
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
Inner join
With an inner join, we combine two data frames based on a common key. Only the rows with matching keys in both data frames are kept.
band_members %>%inner_join(band_instruments)
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
Left join
With a left join, we keep all rows from the left data frame and only the matching rows from the right data frame. If there is no match, the result will contain NA for the columns from the right data frame.
band_members %>%left_join(band_instruments)
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
Right join
With a right join, we keep all rows from the right data frame and only the matching rows from the left data frame. If there is no match, the result will contain NA for the columns from the left data frame.
band_members %>%right_join(band_instruments)
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
Full join
With a full join, we keep all rows from both data frames. If there is no match, the result will contain NA for the columns from the other data frame.
band_members %>%full_join(band_instruments)
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
The tidyverse packages
tidyverse and the data analysis cycle
Tidyverse and the verbs of data manipulation
Leading principle: language of programming should really behave like a language, tidyverse.
tidyverse: a few key verb that perform common types of data manipulation.
Tidy data
The tidyverse packages operate on tidy data:
Each column is a variable
Each row is an observation
Each cell is a single value
Untidy versus tidy data
The dplyr package
Data manipulation with dplyr
The dplyr package is a specialized package for working with data.frames (and the related tibble) to transform and summarize tabular data:
summary statistics for grouped data
selecting variables
filtering cases
(re)arranging cases
computing new variables
recoding variables
dplyr cheatsheet
Common dplyr functions
There are many functions available in dplyr, but we will focus on just the following dplyr functions (verbs):
dplyr verbs
Description
glimpse()
a transposed print of the data that shows all variables
select()
selects variables (columns) based on their names
filter()
subsets the rows of a data frame based on their values
arrange()
re-order or arrange rows
mutate()
adds new variables, or new variables that are functions of existing variables
summarise()
creates a new data frame with statistics of the variables (optional grouped by another variables)
group_by()
allows for group operations in the “split-apply-combine” concept
Select numerical variables with where(is.numeric):
planets %>%select(where(is.factor))
planet_type
Mercury Terrestrial planet
Venus Terrestrial planet
Earth Terrestrial planet
Mars Terrestrial planet
Jupiter Gas giant
Saturn Gas giant
Uranus Gas giant
Neptune Gas giant
Select rows with dplyr::filter()
Selects subsets of the rows of a data frame based on their values.
Select the planets that have a ring and that are gas giants: